perm filename PORP1[7,ALS] blob
sn#032374 filedate 1973-04-04 generic text, type T, neo UTF8
00010 April 3 1973
00020
00030 A Proposal for Speech Understanding Research
00040
00050
00060 It is proposed that the work on speech recognition that is
00070 now under way in the A.I. project at Stanford University be continued
00080 and extended as a separate project with broadened aims in the field
00090 of speech understanding. This work gives considerable promise both of
00100 solving some of the immediate problems that beset speech
00110 understanding research and of providing a basis for future advances.
00120
00130 It is further proposed that this work be more closely tied to
00140 the ARPA Speech Understanding Research groups than it has been in the
00150 past and that it have as its express aim the study and application to
00160 speech recognition of a machine learning process, that has proved
00170 highly successful in another application and that has already been
00180 tested out to a limited extent in speech recognition. The machine
00190 learning process offers both an automatic training scheme and the
00200 inherent ability of the system to adapt to various speakers and
00210 dialects. Speech recognition via machine learning represents a global
00220 approach to the speech recognition problem and can be incorporated
00230 into a wide class of limited vocabulary systems.
00240
00250 Finally we would propose accepting responsibility for keeping
00260 other ARPA projects supplied with operating versions of the best
00270 current programs that we have developed. The availability of the high
00280 quality front end that the signature table approach provides would
00290 enable designers of the various over-all systems
00300 to test the relative performance of the top-down portions of their
00310 systems without having to make allowances for the deficiencies
00320 of their currently available front ends. Indeed, if the signature table
00330 scheme can be made simple enough to compete on a time basis (and we
00340 believe that it can) then it may replace the other front end
00350 schemes that are currently in favor.
00360
00370 Stanford University is well suited as the site for such work,
00380 having both the facilities for this work and a staff of people with
00390 experience and interest in machine learning, phonetic analysis, and
00400 digital signal processing.
00410
00420 Ultimately we would
00430 like to have a system capable of understanding speech from an
00440 unlimited domain of discourse and with unknown speakers. It seems not
00450 unreasonable to expect the system to deal with this situation very
00460 much as people do when they adapt their understanding processes to
00470 the speakers idiosyncrasies during the conversation. The signature table
00480 method gives promise of contributing toward the solution of this
00490 problem as well as being a
00500 possible answer to some of the more immediate problems.
00510
00520 The initial thrust of the proposed work would be toward the
00530 development of adaptive learning techniques, using the signature
00540 table method and some more recent varients and extentions of this
00550 basic procedure. We have already demonstrated the usefulness of this
00560 method for the initial assignment of significant features to the
00570 acoustic signals. One of the next steps will be to extend the method
00580 to include acoustic-phonetic probabilities in the decision process.
00590 Ultimately we would hope to take account of syntactic and semantic
00600 constraints in a somewhat analogous fashion.
00610
00620 Still another aspect to be studied would be the amount of
00630 preprocessing that should be done and the desired balance between
00640 bottom-up and top-down approaches. It is fairly obvious that
00650 decisions of this sort should ideally be made dynamicallly depending
00660 upon the familiarity of the system with the current domain of
00670 discourse and with the characteristics of the current speaker.
00680 Compromises will undoubtedly have to be made in any immediately
00690 realizable system but we should understand better than we now do the
00700 limitations on the system that such compromises impose.
00710
00720 It may be well at this point to discribe the general
00730 philosophy that has been followed in the work that is currently under
00740 way and the results that have been achieved to date. We have been
00750 studying elements of a speech recognition system that is not
00760 dependent upon the use of a limited vocabulary and that can recognize
00770 continuous speech by a number of different speakers.
00780
00790 Such a system should be able to function successfully either
00800 without any previous training for the specific speaker in question or
00810 after a short training session in which the speaker would be asked to
00820 repeat certain phrases designed to train the system on those phonetic
00830 utterances that seemed to depart from the previously learned norm. In
00840 either case it is believed that some automatic or semi-automatic
00850 training system should be employed to acquire the data that is used
00860 for the identification of the phonetic information in the speech. We
00870 believe that this can best be done by employing a modification of the
00880 signature table scheme previously discribed. A brief review of this
00890 earlier form of signature table is given in Appendix 1.
00900
00910 The over-all system is envisioned as one in which the more or
00920 less conventional method is used of separating the input speech into
00930 short time slices for which some sort of frequency analysis,
00940 homomorphic, LPC, or the like, is done. We then interpret this
00950 information in terms of significant features by means of a set of
00960 signature tables. At this point we define longer sections of the
00970 speech called EVENTS which are obtained by grouping togather varying
00980 numbers of the original slices on the basis of their similarity. This
00990 then takes the place of other forms of initial segmentation. Having
01000 identified a series of EVENTS in this way we next use another set of
01010 signature tables to extract information from the sequence of events
01020 and combine it with a limited amount of syntactic and semantic
01030 information to define a sequence of phonemes.
01040
01050 While it would be possible to extend this bottom up approach
01060 still further, it seems reasonable to break off at this point and
01070 revert to a top down approach from here on. The real difference in
01080 the overall system would then be that the top down analysis would
01090 deal with the outputs from the signature table section as its
01100 primatives rather than with the outputs from the initial measurements
01110 either in the time domain or in the frequency domain. In the case of
01120 inconsistancies the system could either refer to the second choices
01130 retained within the signature tables or if need be could always go
01140 clear back to the input parameters. The decision as to how far to
01150 carry the initial bottom up analysis must depend upon the relative
01160 cost of this analysis both in complexity and processing time and the
01170 certainty with which it can be performed as compaired with the costs
01180 associated with the rest of the analysis and the certainty with which
01190 it can be performad, taking due notice of the costs in time of
01200 recovering from false starts.
01210
01220 Signature tables can be used to perform four essential
01230 functions that are required in the automatic recognition of speech.
01240 These functions are: (1) the elimination of superfluous and
01250 redundant information from the acoustic input stream, (2) the
01260 transformation of the remaining information from one coordinate
01270 system to a more phonetically meaningful coordinate system, (3) the
01280 mixing of acoustically derived data with syntactic, semantic and
01290 linguistic information to obtain the desired recognition, and (4) the
01300 introduction of a learning mechanism.
01310
01320 The following three advantages emerge from this method of
01330 training and evaluation.
01340 1) Essentially arbitrary inter-relationships between the
01350 input terms are taken in account by any one table. The only loss of
01360 accuracy is in the quantization.
01370 2) The training is a very simple process of accumulating
01380 counts. The training samples are introduced sequentially, and hence
01390 simultaneous storage of all the samples is not required.
01400 3) The process linearizes the storage requirements in the
01410 parameter space.
01420
01430 The signature tables, as used in speech recognition, must be
01440 particularized to allow for the multi-catagory nature of the output.
01450 Several forms of tables have been investigated. Details of the current
01460 system are given in Appendix 2. Some results are summarized in an
01470 attached report.
01480
01490 Work is currently under way on a major refinement of the
01500 signature table approach which adopts a somewhat more rigorous
01510 procedure. Preliminary results with this scheme indicate that a
01520 substantial improvement has been achieved.